<h1 align="center"> WorFBench </h1>
<h3 align="center"> BENCHMARKING AGENTIC WORKFLOW GENERATION </h3>

## Table of Contents

- 🌟[Overview](#🌟overview)
- 🔧[Installation](#🔧installation)
- 🏋️[Model-Training](#🏋️model-training)
- ✏️[Model-Inference](#✏️model-inference)
- 📝[Workflow-Generation](#📝workflow-generation)
- 🤔[Workflow-Evaluation](#🤔workflow-evaluation)

---

**The test set of our benchmark locates in the `gold_traj` folder.**

**The training set of ourr benchmark locates in the `Llama-Factory/data` folder.**



## 🌟Overview

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.




## 🔧Installation

```bash
git clone https://github.com/xxx/WorFBench
cd WorFBench
pip install -r requirements.txt
```

## 🏋️Model-Training
We use [llama-facotry](https://github.com/hiyouga/LLaMA-Factory) to train model to generate workflow
```bash
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

llamafactory-cli train examples/train_full/internlm2_5_full_sft_ds3.yaml
llamafactory-cli train examples/train_full/qwen2_full_sft_ds3.yaml
```

## ✏️Model-Deployment

We use [llama-facotry](https://github.com/hiyouga/LLaMA-Factory) to deploy local model with OpenAI-style API
```bash
cd LLaMA-Factory
API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml
```


## 📝Workflow-Generation
Generate workflow with local llm api
```bash
tasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)
model_name=your_model_name
for task in ${tasks[@]}; do
    python node_eval.py \
        --task gen_workflow \
        --model_name ${model_name} \
        --gold_path ./gold_traj/${task}/graph_eval.json \
        --pred_path ./pred_traj/${task}/${model_name}/graph_eval_two_shot.json\
        --task_type ${task} \
        --few_shot \

done
```

## 🤔Workflow-Evaluation
Evaluation the workflow in the mode of *node* or *graph*
```bash
tasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)
model_name=your_model_name
for task in ${tasks[@]}; do
    python node_eval.py \
        --task eval_workflow \
        --model_name ${model_name} \
        --gold_path ./gold_traj/${task}/graph_eval.json \
        --pred_path ./pred_traj/${task}/${model_name}/graph_eval_two_shot.json\
        --eval_model all-mpnet-base-v2 \
        --eval_output ./eval_result/${model_name}_${task}_graph_eval_two_shot.json \
        --eval_type node \
        --task_type ${task} \

done
```

